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Reportinq  Period 

Program  Objectives 

To  develop  practical,  low  cost,  real  time  methods  for 
suppressing  noise  which  has  been  acoustically  added  to 
speech . 

To  demonstrate  that  throuah  the  incorporation  of  the 
noise  suppression  methods,  soeech  can  be  effectively 


analysed  for  narrow  band  diaital  transmission  in  practical 
operating  environments. 

Summary  of  Tasks  and  Results 

Introduct ion 

This  semi-annual  technical  report  describes  the  current 
status  in  five  research  areas  for  the  oeriod  1 October  1977 
through  31  March  1978. 


Steven  F.  Boll 

% 


A Stand  alone  noise  suppression  algorithm  is  described 
for  reducing  the  spectral  effects  of  acoustically  added 
noise  in  speech.  A fundamental  result  is  developed  which 
shows  that  the  spectral  magnitude  of  speech  plus  noise  can 
be  effectively  approximated  as  the  sum  of  magnitudes  of 
speech  and  noise.  Usina  this  simple  phase  independent 
additive  model,  the  noise  bias  oresent  in  the  short  time 
spectrum  is  reduced  by  subtracting  off  the  expected  noise 
spectrum  calculated  during  nonspeech  activity.  After  bias 
removal,  the  time  waveform  is  recalculated  from  the  modified 
magnitude  and  saved  phase.  This  Spectral  Averaging  for  Bias 
Estimation  and  Removal,  or  SARER  method  reguires  only  one 
FFT  per  time  window  for  analysis  and  synthesis. 


A SUMMARY  OF  RECENT  EXPERIMENTS 


APPLYING  ADAPTIVE  NOISE  CANCELLATION  TECHNIQUES 
TO  AUDIO  SIGNALS 

t 


Dennis  Pulsipher 


A dual  input  noise  cancellation  technioue  for  audio 
signals  was  presented  in  a semi-annual  report  a year  ago. 
The  philosophy  behind  the  technioue  was  ouite  different  from 
that  of  traditional  techniaues.  Instead  of  estimating  the 
desired  signal  directly,  the  technioue  attempted  to  estimate 
the  noise  directly  and  obtained  a signal  estimate  by 
subtracting  the  noise  estimate  from  the  noisy  signal. 

The  experiments  which  had  been  performed  at  that  time 
used  synthetic  data  and  demonstrated  great  potential  for  the 
technique.  In  the  last  semi-annual  report  initial 
experiments  in  a real  environment  were  described.  A 
description  of  experiments  that  have  followed  and  some  of 
the  ouestions  they  have  raised  comprises  the  body  of  this 
report . 

During  these  experiments  it  became  obvious  that  many 
facets  of  the  noise  cancellation  problem  are  yet  to  be 
understood.  Techniaues  dealing  with  filter  inversion  are 
being  investigated  to  better  understand  the  problems 


involved.  Investigations  are  also  underway  to  improve 
convergence  of  channel  estimates  when  frequency  bands  of  low 


i 
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enerqv  are  contained  in  the  reference  noise  samples.  Pven 
if  these  investigations  are  fruitless,  however,  noise 
cancellation  now  appears  to  be  a worthwhile  approach  to 
signal  restoration  in  acoustically  hostile  environments. 


I 
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Estimation  of  the  Parameters 


of  an  Autoregressive  Moving-Average  Process 
In  the  Presence  of  Noise 


William  J.  Done 


The  previous  report  on  this  project  presented  the 
details  for  the  autoregressive  movina-average  (ARMA)  process 
generated  by  adding  white  noise  to  an  autoregressive  (AR) 
process.  That  report  stressed  the  problems  that  are 
inherent  in  estimating  the  parameters  of  the  resulting  ARMA 
process.  Part  of  this  estimation  problem  lies  in  the 
validity  of  this  model  for  a given  application.  A major 
part  of  the  difficulty,  however,  lies  in  developing 
estimation  procedures  for  ARMA  processes,  regardless  of  the 
source  of  that  process.  The  primary  effort  in  this  project 
since  the  last  reoort  has  been  the  investigation  of  various 
methods  that  might  be  used  to  estimate  the  autoregressive 
and  moving-average  coefficients  of  an  ARMA  process  from  data 
generated  by  that  process.  Three  methods  have  been 
imolemented  for  evaluation. 


Nonparametr ic-Rank  Order  Statistics  Aoplications 


To  Robust  Speech  Activity  Detection 


Renjamin  V.  Cox 


This  report  describes  a theoretical  and  experimental 
investigation  for  detecting  the  presence  of  speech  in 
wideband  noise.  An  algorithm  for  making  the  silence-speech 
decision  is  described.  This  algorithm  is  based  on  a 
nonoarametr ic  statistical  signal-detection  scheme  that  does 
not  reouire  a training  set  of  data  and  maintains  a constant 
false  alarm  rate  for  a broad  class  of  noise  inputs.  The 
nonparametr ic  decision  orocedure  is  the  multiole  use  of  the 
two-sample  Savage  T statistic.  The  performance  of  this 
detector  is  evaluated  and  comoared  to  that  obtained  from 
manually  classifying  seven  recorded  utterances  with  40,  30, 
20,  10,  and  0 dB  signal-to-noise  ratios.  In  limited 
testing,  the  average  probability  of  misclassif ication  is 
less  than  6%,  12%  and  46%  for  signal-to-noise  ratios  of  39, 
20,  and  0 dB  respectively. 


The  Constant-0  Transform 


Jim  Kajiya 

A generalization  of  the  short-time  Fourier  transform  is 
presented  which  performs  constant-percentage  bandwidth 
analysis  of  time-domain  signals.  The  transform  is  shown  to 
exhibit  f reauency-dependent  time  and  frequency  resolution. 
A synthesis  transform  is  also  developed  which  provides  an 
analysis-synthesis  system  which  is  an  identity  in  the 
absence  of  spectral  modification  (given  a mild  analvsis 
window  constraint) . 


ABSTRACT 


A stand  alone  noise  suppression  algorithm  is  described 
for  reducing  the  spectral  effects  of  acoustically  added 
noise  in  speech.  A fundamental  result  is  developed  which 
shows  that  the  spectral  magnitude  of  speech  olus  noise  can 
be  effectively  approximated  as  the  sum  of  magnitudes  of 
speech  and  noise.  Using  this  simple  phase  independent 
additive  model,  the  noise  bias  oresent  in  the  short  time 
spectrum  is  reduced  by  subtracting  off  the  expected  noise 
spectrum  calculated  during  nonspeech  activity.  After  bias 
removal,  the  time  waveform  is  recalculated  from  the  modified 
magnitude  and  saved  phase.  This  Spectral  Averaging  for  Bias 
Estimation  and  Removal,  or  SABER  method  reauires  only  one 
EFT  per  time  window  for  analysis  and  synthesis. 


Suipmary 


Backqr ound 

The  maioritv  of  narrow-band  speech  compression 
algorithms  were  designed  and  tested  based  upon  noise-free 
speech  as  input.  However,  the  systems  constructed  from 
these  algorithms  will  be  used  in  both  ouiet  and  noisy 
environments.  For  the  noise  environments  such  as  the 
helicopter  cockpit,  the  intellioibility  and  Quality  of 
transmitted  compressed  speech  must  be  maintained  at  an 
acceptable  level.  Methods  available  to  suppress  noise  in 
actual  operatina  environments  include  modifying  the  speech 
compression  system,  use  of  noise  cancelling  microphones,  or 
the  insertion  of  a preprocessing  noise  suppression  system 
prior  to  vocoder  input.  This  paper  describes  a 
preprocessing  noise  suppression  algorithm.  This  approach 
was  chosen  since  one,  the  vocoder  system  is  not  modified, 
two,  the  noise  suppression  algorithm  is  now  independent  of 
any  specific  vocoder  implementation,  three,  most  noise 
cancelling  microphones  do  not  generally  remove  noise  above 
about  IkHZ  [Ilf  and  four,  the  method  proposed  is 
straightforward  to  implement  and  can  run  in  real  time. 
Below  are  summarized  the  objectives,  approach,  and  results 
of  this  technique. 
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Objectives 


Develop  a model  for  characterizing  the  spectral  effects 
of  additive  noise  on  speech.  Insure  that  the  model  be 
applicable  simultaneously  to  both  narrow-band  oeriodic  noise 
and  wide  band  colored  noise.  Minimize  the  number  of  apriori 
assumptions  needed  to  justify  the  model  or  imolement  the 
algorithm  based  on  the  model.  Insure  that  in  the  absence  of 
noise  that  the  algorithm  reduces  to  essentially  an  identity 
system. 

Oesian  and  implement  a noise  suppression  algorithm 
based  on  the  model  haying  digital  speech  in  and  digital 
speech  out.  To  afford  low  cost,  real  time  implementation, 
keep  the  implementation  as  simple  as  possible,  use 
straightforward  estimation  techniaues  and  minimize  the 
amount  of  external  information  reouired  for  effective 
implementation. 

Test  the  aloorithm  on  speech  obtained  in  realistic 
operating  environments.  The  speech  should  be  corrupted  with 
noise  generated  by  the  environment  and  acoustically  added  to 
the  speech.  The  tests  should  measure  improyements  in  both 
intelligibility  and  Quality  by  comparing  results  with  and 
without  noise  suppression. 
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Tandeip  the  alqorithm  with  a representative 

processor.  Retest  synthetic  speech  for 

intelligibility  and  ouality  with  and  without  noise 
suppression  preprocessing. 

Specify  the  advantages,  limitations  and  reauirements 
needed  for  a real  time  implementation. 


Mqorithm  Description 


The  following  assumptions  were  used  in  iirolementinq  the 
algorithm.  The  background  noise  is  acoustically  or 
dioitallv  added  to  the  soeech.  The  background  noise 
environment  remains  locally  stationary  to  the  degree  that 
its  spectral  magnitude  expected  value  just  orior  to  speech 
activity  eouals  its  expected  value  during  soeech  activity. 
If  the  environment  chanoes  to  a new  stationary  state,  there 
exists  enough  time  (about  300  ms)  to  estimate  a new 
background  noise  spectral  magnitude  expected  value  before 
speech  activitv  commences.  For  the  slowly  varying 
non-stationary  noise  environment,  the  algorithm  reauires  a 
speech  activitv  detector  to  signal  the  program  that  speech 
has  ceased  and  a new  noise  bias  can  be  estimated.  Finallv 
that  significant  noise  reduction  is  possible  by  removing  the 
effect  of  noise  from  the  maanitude  soectrnm  onlv. 


Basis  for  analysis.  The  fundamental  property  is  developed 
which  demonstrates  that  the  spectral  maqnitude  of  noisv 
speech  can  be  effectively  modeled  as  the  sum  of  magnitudes 
of  speech  and  noise.  As  such  the  additive  noise  exhibits 
itself  as  possibly  a wide  variance  bias  added  to  the  desired 
speech  spectrum.  Therefore  an  estimate  of  the  speech 
magnitude  spectrum  is  obtained  by  subtracting  off  an 
estimate  of  the  noise  bias.  If  the  noise  has  primarily  a 
wide  variance  non-determinist ic  component,  then  local 

/ 


averaging  of  magnitude  soectra  is  used  to  reduce  the  noise 


variance.  If  the  noise  is  primarily  narrov/  variance  then  no 
averaging  is  reouired  for  variance  reduction  prior  to  bias 
removal . 


Method . Speech  is  analyzed  by  windowing  data  from 
half-overlapped  input  data  buffers.  The  magnitude  and  phase 
spectra  of  the  windowed  data  is  calculated  and  the  phase  is 
saved.  Magnitudes  from  adjacent  windows  are  then  averaged 
and  the  spectral  noise  bias  calculated  during  non-speech 
activity  is  subtracted  off.  Resulting  negative  amplitudes 
are  then  either  rectified  or  zeroed  out.  ^ time  waveform  is 
recalculated  from  the  modified  magnitude  and  saved  phase. 
This  waveform  is  then  overlap  added  to  the  previous  data  to 
generate  the  output  speech. 

»Vdvantaqes  and  Limitations . The  method  reouires  only  a 
single  microphone.  It  is  applicable  to  both  wide-band  and 
narrow-band  noise  sources.  The  method  is  computationally 
efficient  requiring  only  one  FFT  per  analysis  frame  with  the 
FFT  computation  per  frame  increasing  logarithmically  with 
the  sampling  rate.  Finally,  the  algorithm  output  is  speech 
and  thus  can  be  tandemed  to  any  narrow-band  speech 
processor . 
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Limitations  of  the  algorithm  include  the  reouirement  of 
a locally  stationary  noise  environment  and  possibly  a speech 
activity  detector  for  updating  the  noise  bias  estimate 
following  a spectral  noise  shift.  If  the  noise  is 
non-coherent , then  the  averaging  reouired  for  variance 
reduction  will  produce  some  temporal  echo-like  smearing.  In 
addition  as  will  be  shown,  the  spectral  estimation  error  is 
proportional  to  the  amount  of  variance  reduction. 

Therefore,  only  partial  noise  cancellation  is  possible  for 
wide  variance  noise  sources. 

Results . The  performance  of  the  SABER  algorithm  has  been 
initially  measured  using  a limited  Diagnostic  Rhyme  Test 
(DRT) . Testing  was  conducted  by  Dynastat,  Inc.  [2]  using 
clear  channel  and  helicopter  noise  tapes.  Measures  for 
improvements  in  intelligibility  as  well  as  a course  measure 
of  auality  were  conducted  using  a single  speaker  test. 

Results  indicate  average  improvements  in  intelligibility 
with  some  subareas  having  major  improvements  and  major 
improvements  in  auality.  Detailed  scores  are  given  below  in 
section  on  results. 
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Algorithm  Imolementation 

I 

Input-Output  Data  Manipulation  j 

* Speech  from  the  A-D  converter  is  segmented  and  windowed  i 

J 

such  that  in  the  absence  of  spectral  modifications  when  the  li 

synthesis  speech  segments  are  added  together,  the  resulting 
overall  system  reduces  to  an  identity.  The  data  is 

! 

segmented  and  windowed  using  on  the  result  [3]  that  if  a 
seouence  is  separated  into  half-overlapped  data  buffers,  and 
each  buffer  is  multiplied  by  a Hanning  window,  then  the  sum 
of  these  windowed  seauences  add  back  up  to  the  original 
seouences.  The  window  lenqth  is  chosen  to  be  approximately 
twice  as  large  as  the  maximum  expected  pitch  period  for 
adeauate  freouency  resolution  [41 . For  the  sampling  rate  of 
8.00  kHz  a window  length  of  256  points  shifted  in  steps  of 
128  points  was  used.  Figure  1 shows  the  data  segmentation 
and  advance: 


Freouency  Analysis 


The  DPT  of  each  data  window  is  taken  and  converted  to 
the  polar  coordinates  of  magnitude  and  phase. 

Since  real  data  is  being  transformed,  two  data  windows 
can  be  transformed  using  one  FFT  [5] . The  FFT  size  is  set 
ecual  to  the  window  size  of  256.  Augmentation  with  zeros 
was  not  incorporated.  As  correctly  noted  by  J.  Allen  [6], 
spectral  modification  followed  bv  inverse  transforming  can 
distort  the  time  wave-form  due  to  temporal  aliasing  caused 
bv  circular  convolution  with  the  time  response  of  the 
modification.  Augmenting  the  input  time  waveform  with  zeros 
before  spectral  modification  will  minimize  this  aliasing. 
Experiments  with  and  without  augmentation  using  the 
helicopter  speech  resulted  in  negligible  differences  and 
therefore  augmentation  was  not  incorporated.  Finally,  since 
real  data  is  analyzed  transform  symmetries  were  taken 
advantage  of  to  reduce  storage  reouirements  essentially  in 
half. 


Magnitude  Averaging 

As  is  shown  below,  the  variance  of  the  noise  spectral 
estimate  is  reduced  by  averaging  over  as  many  spectral 
magnitude  sets  as  possible.  However,  the  non-stationar ity 
of  the  speech  limits  the  total  time  interval  available  for 
local  averaging.  The  number  of  averages  is  limited  by  the 
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number  of  analysis  windows  which  can  be  fit  into  the 


stationary  speech  time  interval.  The  choice  of  window 
lenqth  and  averaging  interval  must  compromise  between 
conflicting  reaui rements . For  acceptable  spectral 
resolution  a window  length  greater  than  twice  the  expected 
largest  pitch  period  is  reauired  with  a 256  point  window 
being  used.  For  minimum  noise  variance  a large  number  of 
windows  are  required  for  averaging.  Finally,  for  acceptable 
time  resolution  a narrow  analysis  interval  is  reauired.  \ 
reasonable  compromise  between  variance  reduction  and  time 
resolution  appears  to  be  three  averages.  This  results  in  an 
effective  analysis  time  interval  of  38  ms. 


’3ias  Estimation 

The  SABER  method  requires  an  estimate  at  each  frequency 
bin  of  the  expected  value  of  noise  magnitude  spectrum, 

Ujj  * e{|n|} 

This  estimate  is  obtained  by  averaging  the  signal  magnitude 
spectrum  |X|  during  non  speech  activity.  Estimating  in 

this  manner  places  certain  constraints  when  implementing  the 
method.  If  the  noise  remains  stationary  during  the 
subsequent  speech  activity,  then  an  initial  startup  or 
calibratioti  period  of  noise-only  signal  is  required.  During 
this  period  (on  the  order  of  a third  of  a second)  an 
estimate  of  can  be  computed.  If  the  noise  environment 


is  nonstationarv  then  a new  estimate  of  u must  be 

N 

calculated  prior  to  basis  removal  each  time  the  noise 
soectrum  changes.  Since  the  estimate  is  computed  using  the 
noise-only  signal  during  non-speech  activity,  a voice  switch 
is  reauired.  When  the  voice  switch  is  off  an  average  noise 
spectrum  can  be  recomputed.  If  the  noise  magnitude  spectrum 
is  changing  faster  than  an  estimate  of  it  can  be  computed, 
then  time  averaging  to  estimate  cannot  be  used. 

Likewise  if  the  expected  value  of  the  noise  spectrum  changes 
after  an  estimate  of  it  has  been  computed,  then  noise 
reduction  through  bias  removal  will  be  less  effective  or 
even  harmful. 


Lias  Removal 


The  SABER  spectral  estimate  s. 


is  obtained  by 


subtracting  the  expected  noise  magnitude  spectrum  from 
the  averaged  maanitude  signal  spectrum  l x I 


Thus : 


S^(k)  = |x(k)  I - Vjj(k)  k = 0,  1,  ...,  L-1 


where  L =DFT  buffer  length. 


After  subtracting,  the  differenced  values  having 
neqative  magnitudes  can  either  be  set  to  zero  (half 
rectification)  or  be  made  oositive  (full  rectification). 
These  negative  differences  represent  freauencies  where  the 


sum  of  speech  plus  local  noise  is  less  than  the  expected 
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noise 


As  referenced  below,  full-wave  rectification 


minimizes  the  spectral  error.  However,  if  the  noise  source 
drops  out  during  speech,  full  rectification  will  result  in 
the  expected  noise  value  being  incorrectly  added  back  in  to 
the  speech  spectrum.  This  in  fact  haopened  for  the 
helicopter  tapes  processed.  Therefore  half  rectification 
was  used.  Figures  2 and  3 show  input-output  freauency 
relations  for  half  and  full  rectification. 


Figure  2 

Input  - 


Output  Relations 


Synthesis 

After  bias  removal  and  rectification,  a time  waveform 
is  reconstructed  from  the  modified  magnitude  and  the  phase 
buffer  corresponding  to  the  center  window.  Again  since  only 
real  data  is  Generated,  two  time  data  sets  are  computed 
simultaneously  using  one  inverse  FFT.  The  data  windows  are 

i 

\ 

I 
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Results 


r 


The  ability  of  this  method  to  improve  intelligibility 
is  being  measured  using  the  Diagnostic  Rhyme  Test  (DRT)  [2] . 
A measure  of  auality  improvement  is  also  available  usinq  the 
DRT  data  base  [7].  This  section  lists  preliminary  results 
for  a limited  DRT  test  using  a single  speaker.  The  data, 
provided  by  Dynastat,  Inc.,  consisted  of  soeaker  RH  recorded 
in  a helicopter  environment.  The  results  are  given  using 
Tables  1 and  2.  Table  1 list  intelligibility  scores  for  the 
original  data  and  the  SABER  output,  followed  by 
intelliqibility  scores  for  an  LPC  vocoder  output  which  used 
oriqinal  or  SABER  as  inout.  Table  2 list  auality  scores  of 
original  and  SABER  followed  by  auality  scores  of  LPC  output 
using  either  oriqinal  or  SABER  as  input. 


Or  iainal 

SABER 

LPC  on 

LPC  on 

Original 

SABER 

Voicing 

95 

91 

84 

86 

Nasal itv 

82 

77 

56 

52 

Sustens ion 

92 

86 

49 

56 

Sibilation 

75 

84 

61 

88 

Graveness 

68 

66 

61 

59 

Compactness 

88 

88 

83 

93 

Total 

84 

82 

66 

72 

Table  1 

DRT  Scores  for  Single 

Speaker 

Original 

SABER 

LPC  on 

LPC  oi 

Original 

SABER 

Naturalness 

49 

47 

40 

41 

Inconspicuousness 

30 

41 

29 

38 

of  Background 

Intelligibility 

31 

30 

22 

26 

Pleasantness 

16 

26 

13 

23 

Acceptability 

28 

32 

22 

28 

Total 

25 

29 

19 

25 

Table  2 

quality 

Scores  from 

DRT  Data 

Observations 

This  single  speaker  DRT  test  indicates  that  SABER 
processing  followed  by  LPC  significantly  increases 
intelligibility.  Scores  in  the  areas  of  voicing,  nasality 
and  graveness  are  about  equal.  It  improves  the 
apprehensibility  of  sustension,  sibilation,  and  compactness. 

The  Duality  measures  taken  clearly  indicate  that  SABER 
enhances  listener  acceptability.  The  background  noise  is 
less  conspicuous,  and  the  processed  speech  more  pleasant. 
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Analysis  of  the  Phase  Independent  Model 


Assume  that  a noise  signal  n,  has  been  added  to  a 
soeech  signal  s,  with  their  sum  denoted  as  x. 

Then 

X “ s + n 

Taking  the  Fourier  transform  gives 

X “ S + N 


The  desired  speech  spectral  magnitude,  |S|  is  given  by 
Isj  » lx  - n1 


The  zero  phase  approximation  s 

z 

= |xl  - |n| 


to  |S|  is  given  by 


When  s goes  negative  it  can  be  half-rectified  s 
2 H 

or  full-rectified,  s : 

F 


The  spectral  error  D at  any  freauency  is  given  by 


s - s. 


S + I S I 
|s|  -(-2-t-J-^) 
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It  can  be  shown  [8]  that  the  full-rectified  modelinq 
error  is  zero  for  |n|>|s|  and  the  relative  error  D/lSl 
inversely  prooortional  to  the  signal  to  noise  ratio  for 
|s|>|n|.  For  |x|>|n|  the  half-rectified  modeling  error  will 
increase  to  as  much  as  |S|.  However,  if  the  noise  floor 
were  to  suddenly  decrease  well  below  its  average  value,  the 
full-rectified  estimate  would  incorrectly  add  noise  into  the 
estimate  whereas  the  half-rectified  estimate  would  not. 
Thus  the  half-rectified  estimate  would  give  better  results 
in  this  situation. 


Analysis  and  Reduction  of  Estimation  Error  Frror  Estimate 

Using  the  zero  phase  model  (assuming  |xj>(N( 
simplicity)  the  SABER  estimation  error  is  given  bv 

^ " 1*1  ■ ‘‘n  ■ 1*1  ^ 

where 

A 

“ (x(  - Ujj  equals  unaveraged  SABER  estimate 

Pjj  * E{(n|}  equals  expected  noise  magnitude 
spectrum 

■ |x|  - |n|  zero  phase  estimate  of  |s| 


Thus  the  spectral  error  e eauals,  |n|  - p the 

W#  1 

1 

difference  between  the  magnitude  of  the  noise  spectrum  and 
its  expected  value. 


Averaqinq 


The  spectral  error  can  be  reduced  by  averaging 
magnitude  spectral  | x | . The  amount  of  reduction  by 

averaging  has  been  carefully  investigated  [8] , [9] . For 

example,  if  five  half-overlapped  windows  are  used  [8]: 

Eidsf  - - 0.275  var  {1n|}  - (0.06)aj^2  ^ 

This  gives  a total  variance  reduction  of  -12.4  dB. 
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A SUMMARY  OF  RECENT  EXPERIMENTS 


APPLYING  ADAPTIVE  NOISE  CANCELLATION  TECHNIQUES 
TO  AUDIO  SIGNALS 


Dennis  Pulsipher 


/ 


; 
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Introduction 


A dual  input  noise  cancellation  technioue  for 
audio  signals  was  presented  in  a semi-annual  report  a 
year  ago.  The  philosophy  behind  the  technioue  was 
ouite  different  from  that  of  traditional  technioues. 
Instead  of  estimating  the  desired  signal  directly,  the 
technioue  attempted  to  estimate  the  noise  directly  and 
obtained  a signal  estimate  by  subtracting  the  noise 
estimate  from  the  noisy  signal. 

The  experiments  which  had  been  performed  at  that 
time  used  synthetic  data  and  demonstrated  great 
potential  for  the  technioue.  In  the  last  semi-annual 
report  initial  experiments  in  a real  environment  were 
described.  A description  of  experiments  that  have 
followed  and  some  of  the  ouestions  they  have  raised 
comprises  the  body  of  this  report. 
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The  Experiments 

Upon  successful  completion  of  the  synthetic 
results  described  in  previous  reports,  experiments  to 
evaluate  real  situations  were  begun.  An  attempt  was 
made  to  design  these  experiments  so  that  real  acoustic 
situations  were  used,  without  completely  destroying  the 
validity  of  the  assumed  data  generation  model. 

Efforts  to  maintain  a certain  amount  of 

consistency  between  experiments  resulted  in  a set  of 
control  conditions  which  were  maintained  constant  for 
all  recent  experiments.  To  minimize  recording  effects, 
it  was  decided  to  digitally  record  the  noisy  signal  and 
the  noise  reference  signal  simultaneously.  All  signals 
were  low-pass  filtered  to  a bandwidth  of  3.2  kHz  and 
sampled  at  a rate  of  6.67  kHz.  Control  of  the  j 

environment  was  maintained  by  recording  in  a single, 
isolated,  but  acoustically  live  room.  While  no  effort 
was  made  to  simulate  a point  noise  source,  the  noise 
was  generated  at  the  side  of  the  room  by  a single,  high 

I 

Quality  speaker  system,  which  was  kept  in  a fixed  j 

position.  i 

'I 

f I 

The  initial  experiments  performed  used  two  I 


channel  and  the  noise  reference  channel.  By  using  slow 
adaption  rates  (time  constants  of  approximately  5 
seconds)  and  long  transversal  filter  lengths  (3000 
points) , noise  reduction  of  approximately  16  dB  was 
achieved.  Doubling  the  length  of  the  filter  resulted 
in  about  1 dB  improvement  over  that  level. 

Since  synthetic  experiments  had  yielded 
sianif icantly  better  results,  Questions  were  raised 
about  the  yalidity  of  treating  a room  as  a linear 
channel,  whether  or  not  small  moyements  in  the  room 
affected  stationarity  assumptions,  if  the  lack  of  a 
point  noise  source  was  a serious  complication,  or  if 
something  about  the  channels  themselyes  was  the  cause 
of  the  degradation. 

To  identify  the  sources  of  degradation  another 
series  of  experiments  was  undertaken.  Empirical 
estimates  of  the  impulse  response  of  the  room  from  a 
single  source  to  two  separate  points  in  the  room  were 
made.  \ known  digitized  noise  source  was  then 
digitally  filtered  through  the  two  different  impulse 
responses  measured.  These  filtered  noise  samples  were 
then  used  as  the  noisy  signal  and  reference  noise 
inputs  to  the  noise  cancellation  algorithm.  Thus,  the 
acoustically  recorded  experiments  were  simulated  with 
similar  channels  to  to  those  expected,  but  wherein 
linearity  and  stationarity  were  guaranteed.  These 


experiments  yielded  results  roughly  2 dR  better  than 
the  corresponding  experiments  which  used  acoustically 
produced  data.  Differences  in  microphone  placement  and 
lack  of  additional  uncorrelated  noise  at  low  levels 
were  considered  capable  of  causing  such  minor 
differences,  and  it  was  concluded  that  assumptions  of 
both  stationarity  and  linearity  of  the  channels  were 
probably  justified.  It  was  also  concluded  that  the 
lack  of  a ooint  source  was  not  a serious  problem. 

At  this  point  it  was  strongly  suspected  that  the 
fact  that  one  of  the  channels  had  to  be  effectively 
inverted  was  the  cause  of  the  degradation.  Many 
theoretical  and  practical  issues  regarding  inverse 
filtering  were  considered  and  it  was  decided  to  devise 
a ouick  experiment  to  verify  this  suspicion. 

Since  the  major  obstacle  to  great  success  with 
acoustic  data  appeared  to  be  involved  with  inverting 
one  of  the  room's  channels,  it  was  decided  to  see  if 
the  problem  could  be  eliminated  by  forcina  that  channel 
to  be  an  identity  system,  which  could  be  trivially 
inverted.  The  noisy  signal,  was  therefore  recorded  as 
before,  with  a microphone  placed  in  the  middle  of  the 
room.  The  reference  noise,  however,  was  not  recorded 
through  a microphone  at  all,  but  directly  from  the 
electrical  signal  used  to  drive  the  speaker  system. 
This  configuration  achieved  noise  reduction  of 


arproximately  26  confirming 


suspicions  that 


effective  channel  inversion  was  a major  problem. 

It  was  then  wondered  if  careful  placement  of  the 
noise  reference  pick-up  microphone  might  be  used  to 
improve  results  by  making  the  channel  needing  inversion 
appear  to  be  a near  identity  system.  The  acoustic 
experiment  was  repeated  with  the  noisv  signal  being 
recorded  from  the  middle  of  the  room,  and  the  noise 
reference  being  recorded  by  a microphone  placed 
directly  facing  the  high  energy  output  section  of  the 
speaker  system.  Results  comparable  to  the  simulated 
room  experiments  were  obtained  from  this  experiment  (18 
dB  noise  reduction) . While  this  technigue  may  be 
effective  if  a real-time  system  is  available  to  search 
for  an  optimal  nosition  for  reference  noise  collection, 
results  indicated  that  it  was  not  simply  a matter  of 
closeness  of  microphone  placement  to  the  noise  source 
which  was  going  to  be  a final  solution. 
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Conclusions 

During  these  experiments  it  became  obvious  that 
many  facets  of  the  noise  cancellation  problem  are  yet 
to  be  understood.  Technioues  dealing  with  filter 
inversion  are  being  investigated  to  better  understand 
the  problems  involved.  Investigations  are  also 
underway  to  imorove  convergence  of  channel  estimates 
when  freouency  bands  of  low  energy  are  contained  in  the 
reference  noise  samples.  Fven  if  these  investigations 
are  fruitless,  however,  noise  cancellation  now  aooears 
to  be  a worthwhile  approach  to  signal  restoration  in 
acoustically  hostile  environments. 


Estimation  of  the  Parameters 


of  an  Autoregressive  Moving-Average  Process 
In  the  Presence  of  Noise 


William  J.  Done 


The  previous  report  on  this  project  presented  the 
details  for  the  autoregressive  moving-average  (ARMA)  process 
generated  by  adding  white  noise  to  an  autoregressive  (AR) 
process.  That  report  stressed  the  problems  that  are 
inherent  in  estimating  the  parameters  of  the  resulting  ARMA 
process.  , Part  of  this  estimation  problem  lies  in  the 
validity  of  this  model  for  a given  application.  A major 
part  of  the  difficulty,  however,  lies  in  developing 
estimation  procedures  for  ARMA  processes,  regardless  of  the 
source  of  that  process.  The  primary  effort  in  this  project 
since  the  last  report  has  been  the  investigation  of  various 
methods  that  might  be  used  to  estimate  the  autoregressive 
and  moving-average  coefficients  of  an  ARMA  process  from  data 
generated  by  that  process.  Three  methods  have  been 
implemented  for  evaluation. 


One  procedure  mentioned  in  the  previous  report  is  the 
Mode  1 iterative  method  by  Steiglitz  and  McBride  [4] . The 
approach  is;  given  input  and  output  seauences  for  an 
unknown  system,  determine  the  filter  which  aoproximates  the 
unknown  system  by  using  a filter  which  is  the  ratio  of  two 
rational  polynomials  (in  the  Z-domain) . Graphically,  the 
problem  is  illustrated  in  Figure  1.  Th'^  polynomials  A(z) 
and  B(z)  are  given  by 
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The  coefficients  a(i)  and  b(i)  in  A(z)  and  B(z), 
respectively,  are  selected  to  minimize  E(z)  in  some  sense. 
The  model's  response,  V(z),  is 

B(z) 


v(z) 


A (z) 


U(z) 


1) 


or 

A(z)  V{z)  - B(z)  U(z) 
Also,  from  the  block  diagram,  we  have 
E(z)  - V(z)  - X(z) 


2) 


3) 


Steiqlitz  then  performs  a "ouasi-linear ization"  on  2) , using 
previous  iterations  to  form  aooroximations  to  the 
derivatives , 

A^(z)V^(z)  + [A^'*’^  (Z)  - A^(z)]V^(z)  + A^(z)  [V^''^^{z)  -V^(z)} 

- B^'^^Cz)  U(z)  4) 


where  the  superscript  indicates  the  iteration  number. 
Replacing  V (z)  with  X(z)  in  4)  and  simplifying  gives 

A^{z)V^'^^(z)  - [A^(z)  - A^'^^(z)]  x{z)  + B^'^^Cz)  U(z)  5) 


Solving  for  V (z)  and  using  that  expression  for  V(z)  in  3) 
gives 


6) 
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recursively  filtered  through  the  i^*’  iteration  of  A(z), 
define  U(z)  = U(z)/A  (z)  and  X(z)  = X(z)/A  (z).  With  these 
definitions,  the  time  domain  representation  for  6)  is 

P q 

e()t)  - I b(i)  u(k-l)  - I x(j)a(k-j)  7) 

i-0  j“o 


where  the  iteration  number  has  been  dropped  for  convenience. 

q p 

The  coefficients  {a(i)}^  and  {b(i)}j^  are  selected  to 
minimize  e{k)  in  the  least  souares  sense.  The  least  sauares 


procedure  reouires  fhe  solution  of  the  matrix  eouation 


R 6 “ r 
— ux  — — ux 


where  R is  a matrix  composed  of  the  auto-  and 

— ux 

cross-correlations  of  u(k)  and  x(k);  r is  a vector 

— ux 

composed  of  those  correlations;  and  6 is  the  solution 
vector  containing  the  desired  a(i)  and  b(i)  coefficients. 
Use  of  this  method  thus  reouires  the  solution  of  a set  of 
p + o + 1 linear  simultaneous  equations. 


For  application  to  the  estimation  of  the  coefficients 
of  an  ARMA  orocess,  this  technioue  must  be  modified 
slightly.  When  only  the  output  of  the  system  is  known,  u(k) 
is  assumed  to  be  the  Kronecker  delta  function.  Also,  the 
system  output  x(k)  may  be  modified  so  that  it  more  closely 
resembles  an  impulse  response,  as  the  assumotion  for  u(k) 
imolies.  To  test  Steiglitz's  Mode  1 method,  the  following 
tests  have  been  performed: 

1)  From  a known  model,  considered  to  be  the  system  to  be 
identified,  generate  an  output  sequence  x(k). 
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2)  The  input  to  the  "unknown"  system  is  one  of;  impulse, 
impulse  train,  or  noise  (approximately  white) . 

3)  Use  the  Mode  1 method  to  compute  estimates  for  the 
parameters  of  the  "unknown"  system. 

4)  Comoare  the  parameter  estimates  to  the  design 
parameters. 

The  results  for  one  10-pole,  2-zero  model  system  are  now 
presented.  This  model  is  identical  to  that  used  by 
Steiglitz  in  [3] . 

Figure  2 is  the  model  spectrum  to  be  identified.  Note 
that  the  zeros  are  a complex  conjugate  pair  located  on  the 
unit  circle.  Figure  3 is  the  output  of  this  model  when 
excited  by  an  impulse.  Figure  3a)  is  the  time  sequence  and 
Fig.  3b)  is  the  spectrum  of  that  seouence  in  dQ.  The 
estimate  for  the  model  spectrum  produced  by  one  iteration  of 
this  method  is  shown  in  Fig.  4b) . The  model  spectrum  is 
repeated  in  Fia.  4a)  . 

The  time  seouence  produced  by  exciting  the  model  with 
an  impulse  train  is  shown  in  Fig.  5a).  The  period  for  this 
example  is  100.  In  Fig.  5b)  is  the  estimate  of  the 
spectrum  of  this  process  obtained  by  Hamming  windowing  the 
time  seouence  and  performing  a OFT.  The  result  is  in  dB. 
In  using  the  Mode  1 technique  for  this  type  of  time 
seouence,  it  is  desirable  to  preprocess  x(k)  to  make  it  more 
like  an  impulse  response.  Figure  6a)  shows  the  real  part  of 
the  complex  cepstrum  of  x(k)  after  Hamming  windowing  x(k). 


Bv  properly  windowinq  this  cepstrum,  two  useful  tasks  are 
accomplished.  First,  by  eliminatina  the  spikes  resulting 
from  the  harmonics  in  the  frequency  domain,  the  periodic 
nature  of  x(k)  can  be  suporessed.  The  second  step  is  to 
force  this  cepstral  representation  of  x(k)  to  be  causal. 
Upon  returning  to  the  time  domain,  if  aopropriate  scaling 
has  been  done  in  the  cepstrum,  the  resulting  time  series 
will  be  minimum  phase.  Figure  6b)  shows  the  ceostrum  after 
windowing.  Figure  7b)  contains  the  new  minimum  phase  time 
seouence,  while  Fig.  7a)  contains  the  output  of  the  impulse 
excited  model  for  comparison.  Figures  Oa)  and  8b)  are 
respectiyely  the  spectral  estimates  of  x(k)  and  x (k) , the 
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modified  yersion  of  x(k).  Mote  in  Fig.  7 that  Xn,p{k)  is 
Quite  similar  to  x(k)  from  the  impulse  excited  case.  Figure 
8 shows  the  suppression  of  the  harmonic  structure  on  the 
spectrum  of  x(k)  caused  by  the  periodic  nature  of  x(k).  The 
Mode  1 technique  is  now  applied  to  x (k) . Figure  9b)  shows 
the  estimated  spectrum  for  the  impulse  train  excited  case 
after  two  iterations.  The  original  model  spectrum  is 
repeated  in  Fig.  9a). 

The  last  case  to  be  considered  is  when  the  model  has 
been  excited  by  a noise  sequence.  The  resulting  output 
seouence  and  spectral  estimate  are  shown  in  Fig.  10a)  and 
10b)  , respectiyely.  Superimposed  on  the  spectrum  of  the 
noise  excited  x(k)  is  the  original  model  spectrum.  Note  the 
random  yariations  from  that  ideal  spectrum  resulting  from 
the  deviation  of  the  excitation  seouence  from  an  ideal  white 


noise  process.  Piqures  lib) 
the  spectral  estimates  produced 
first  and  second  iterations, 
improve  upon  the  the  estimate. 


and  12b) , respectively,  show 
by  this  technioue  after  the 
Further  iterations  fail  to 
which  is  poor. 


Because  of  the  results  for  the  noise  excited  case,  this 
method  has  been  discarded.  The  estimates  obtained  for  noise 
excited  processes  were  consistently  poor,  often  converging 
to  unstable  filter  estimates.  In  addition,  the  Mode  1 
method  is  stronglv  dependent  upon  double  precision 
arithmetic  to  achieve  success,  even  in  the  impulse  and 
impulse  train  excited  cases. 


The  parameter  estimation  presently  being  investigated 

is  that  given  in  Anderson  [1] . This  method  is  based  on  a 

time  domain  Newton-Raphson  approach  to  maximization  of  the 

loa  likelihood  function.  Anderson's  approach  in  this  method 

is  to  assume  zero  initial  conditions  for  the  data  x(k), 

k < 0.  If  0(0)  is  the  log  likelihood  function  to  be 

maximized,  where  0 is  the  vector  of  coefficients 

q q 

{a(i)},  and  {b(i)}  to  be  estimated,  then 
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exoansion  of  9)  about  the  optical  snlm-ionO  is  performed; 

...  • 

g(0*)  + g* (6*)  (0-0*)  =0  10)  . ' 

Solvinq  10)  for  o , we  have  the  Newton-Raphson  iterative 
method  for  maximizing  the  loq  likelihood  function. 

The  Newton-Raphson  method  has  the  advantage  of  the  most 
rapid  convergence  when  in  the  neighborhood  of  the  solution 
if  the  log  likelihood  function  is  approximatelv  Quadratic  in 
that  region.  The  method  does  reauire  the  evaluation  of  the 
second  derivatives  of  the  loq  likelihood  function.  Also,  if 
the  initial  guess  for  the  parameters  is  not  in  the 
neighborhood  of  the  maximum,  the  convergence  rate  may  be 
slow  or  convergence  to  an  incorrect  solution  may  occur.  The 
Newton-Raphson  (N-R)  method  reauires  the  inversion  of  the 
Hessian  of  0(0),  which  is  not  guaranteed  to  be  positive 
semidefinite  at  0 . As  a result,  0(0^^^  ) may  be  less  than 
0(0^ ) f where  the  superscript  indicates  the  iteration  of  0 . 

There  are  methods  which  avoid  some  of  these  problems.  These  { 

I 

will  be  investigated  at  a later  date.  j 

I 

, i 

The  N-R  estimation  technioue  has  been  implemented  and  ! 

is  presently  being  evaluated.  Its  operation  has  been  tested  | 

on  impulse  and  noise  excited  models.  For  the  impulse 
excited  models,  the  estimate  is  excellent.  In  the  noise 
excited  seouences,  errors  do  occur  in  the  parameter 
estimates.  Those  for  strictly  AR  or  MA  processes  seem  to  be 
consistent  with  the  findings  of  others.  The  estimates  for 
ARMA  processes  tend  to  exhibit  more  error.  In  general,  the 
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N-R  method  seems  to  provide  better  estimates  of  the 
parameters  than  Steiglitz's  Mode  1 method  (especially  for 
the  noise  excited  data) , but  a high  variability  in  the 
accuracy  of  the  estimates  occurs  from  frame  to  frame. 
Double  orecision  arithmetic  does  not  seem  to  be  necessary 
for  this  method. 

In  an  attempt  to  verify  the  correct  operation  of  the 
N-P  method,  a third  estimation  procedure  has  been 
implemented.  This  is  a direct  search  method  based  on  the 
unconditional  sum  of  sauares  method  discussed  in  Box  and 
Jenkins  [2] . This  method  is  inefficient  computationally 
because  the  sum  of  sauares  of  the  estimated  excitation 
seauence  must  be  computed  at  a large  number  of  points  in  the 
parameter  space.  Its  usefulness  lies  in  allowing  one  to 
view  the  shape  of  the  space  for  low  order  cases.  For  the 
AR(1)  process  degraded  by  white  noise,  the  observed  data  is 
an  ARMA(1,1)  orocess.  This  ARMA(1,1)  orocess  is  generated 
by  computing  the  movinq-average  coefficient  and  excitation 
variance  that  would  result  from  adding  white  noise  to  the 
AR(1)  process.  The  ARMA(1,1)  process  is  then  generated 
directly,  as  opposed  to  generating  the  ARMA(1,1)  process  by 
creating  the  AR(1)  process  and  adding  the  white  noise. 
Using  both  the  M-R  method  and  direct  search  method,  tests 
oerformed  on  this  first  order  model  have  been  useful  in 
predicting  how  this  approach  will  respond  for  varying  levels 
of  noise.  Further  tests  on  the  N-R  method  and  modifications 
are  now  being  conducted. 
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TO  ROBUST  SPEECH  ACTIVITY  DETECTION 
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This  report  describes  a theoretical  and  experimental  investigation 
for  detecting  the  presence  of  speech  in  wideband  noise.  An  algorithm 
for  making  the  silence-speech  decision  is  described.  This  algorithm 
is  based  on  a nonpar ametr ic  statistical  signal-detection  scheme  that 
does  not  require  a training  set  of  data  and  maintains  a constant  false 
alarm  rate  for  a broad  class  of  noise  inputs.  The  nonparametr ic 
decision  procedure  is  the  multiple  use  of  the  two-sample  Savage  T 
statistic.  The  performance  of  this  detector  is  evaluated  and  compared 
to  that  obtained  from  manually  classifying  seven  recorded  utterances 
with  40,  30,  20,  10,  and  0 dB  signal-to-noise  ratios.  In  limited  testing, 
the  average  probability  of  misc la ss if ica t ion  is  less  than  6%,  12%  and 
46%  for  signal-to-noise  ratios  of  39,  20,  and  0 dB  respectively. 


I 


Introduction 


[ 

L 
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The  problem  of  detecting  voice  signals  in  the  presence  of  noise 
has  only  been  addressed  by  a small  number  of  investigations.  In  these 
investigations,  the  traditional  approach  to  distinguishing  between 
voice  and  noise  or  estimate  the  bandwidth  occupied  by  the  speech  signal 
was  to  level  detect  waveform  power  (1,  2,  3,  4,  10,  24).  The  thresh- 
old normally  was  experimentally  determined  by  a limited  training  set 
of  data  (1,  4,  5,  6),  by  the  maximum  live  noise  power  recommended  by 
CCITT  for  telephone  channels  (1,  2,  3,  4,  7,  8,  13)  or  by  a threshold 
adjustment  process  updated  on  a fixed  schedule  (every  half  second) (25). 

Recently,  Atal  and  Rabiner  (21)  suggested  a pattern  recognition 
approach  to  vo iced -unvo iced -s ilenc e classification  in  which  five  mea- 
surements or  features  - - energy,  zero-crossing  rate,  autocorrelation 
coefficient  at  unit  sample  delay,  first  predictor  coefficient  and  energy 
of  the  predictor  errors  were  combined  using  a non-Euc 1 id ian  distance 
metric  to  give  a reliable  decision.  This  method  was  optimized  for 
telephone  line  inputs  by  Rabiner,  et  al,  (22)  and  used  for  digit  recog- 
nition by  Rabiner,  et  al  (18,  19).  The  algorithm  was  modified  to  do 
an  average  signal  spectrum  template  match  using  the  LPC  distance  mea- 
sured (23). 

L.  S.  Siegel  (67)  proposed  a modification  to  the  Atal  (21)  al- 
gorithm in  which  a relatively  small  set  of  samples  is  used  to  "train" 
the  classifier. 


Lin  (71),  Adoul  (28),  and  Adoul  (29)  modified  Atal  and  Rabiner's 


J 


1 


57 


pattern  recognition  approach  for  their  proposed  detectors. 

Reliable  discrimination  between  silence,  unvoiced  speech,  and 
voiced  speech  is  a difficult  problem  because  no  general  theory  exists 
which  can  preselect  the  optimal  features  for  input  to  the  classifier. 
Furthermore,  robustness  is  not  achieved  because  parametric  statistics 
that  rely  on  normality  assumption  were  used  in  almost  all  the  past 
investigations.  The  past  algorithms  also  required  a training  set  of  data 
to  determine  the  required  detector  threshold  levels. 

In  this  report,  a nonparametr Ic  statistical  detection  technique 
Is  used  to  classify  a given  interval  of  speech  data  as  silence  or  speech 
and  presents  results  on  a limited  experiment  of  seven  utterances  for 
various  signal-to-noise  ratios.  The  advantages  of  this  technique  are 
that  the  proposed  detection  algorithm: 

• maintains  a constant  false-alarm  rate  (CFAR)  at  the  detector 
output  for  large  classes  of  noise  distributions 

• it  is  robust  (insensitive  to  changes  not  under  test)  and  power- 
ful (sensitive  to  specific  factors  under  test) 

• does  not  require  statistical  information  about  either  the  signal 
or  the  background  noise  (does  not  require  a training  set  of 
data) 

• performance  for  signals  in  non-Gausslan  noise  may  often  sur- 
pass that  of  detectors  optimized  against  Gaussian  noise 
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• will  operate  where  the  noise  statistics  are  non-st a t ionary  or 
change  from  one  application  to  another 

< 

1 

• simple  to  implement  digitally 

J 

• for  large  n,  it  is  efficient  as  the  Nymann-Pear son  detector 
for  a wide  class  of  noise  distributions 

Nonparametric  decision  procedures  have  been  previously  applied  on 
radar  or  Sonar  systems  that  have  to  operate  in  an  environment  of  heavy 
external  Interference  (48).  The  major  reason  behind  the  use  of  this 
type  of  detector  is  its  ability  to  maintain  a constant  false-alarm  rate 
(CFAR)  for  large  classes  of  noise  distributions  (noise,  weather,  clutter, 
interference) . These  detectors  can  be  designed  to  operate  in  an  en- 
vironment where  very  little  statistical  information  about  either  the 
signal  or  the  background  noise  is  available.  In  addition,  the  detection 
performance  of  such  detectors  for  signals  in  non-Guassian  noise  may 
often  surpass  that  of  detectors  optimized  against  Gaussian  noise  (48, 

60,  63). 

This  study  has  been  primarily  concerned  with  detectors  based  on 
rank  order  statistics.  A ranking  of  data  samples  in  a set  of  obser- 
vations X is  normally  specified  by  the  vector  of  Ranks  R = (R^^ R^,) 

where  Ri  is  the  rank  of  the  observation  Xi  in  the  sample  set.  For 
example,  with  a sample  set  X = (15,  256,  9,  4)  we  have  R “ (3,  4,  2,  1). 

In  a general  rank  test,  the  test  statistic  T is  a function  of  the 
1 vector  of  ranks  R. 
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Where  g (•)  is  a function  of  the  ranks,  anc  Ci  are  coefficients 
that  must  be  determined. 


The  data  expressed  in  form  of  the  scalar  test  statistic  T is  then 
formed  into  an  acceptance  and  rejection  region  for  the  null  hypothesis 
Ho,  noise  only  present  in  the  data  sample  as  follows: 

Ho  • T(R)  < K 
Hi  • T(R)  ^ K 

Where  K is  a constant  of  threshold.  This  study  recommends  the 
form  of  the  nonparametric  statistic  and  the  decision  procedure. 

The  theory  of  nonparametric  statistics  suggests  that  a robust 
detector  can  be  obtained  by  formulating  the  speech  and  background 
noise  signals  in  terms  of  a nonparametric  rank  test.  The  theoretical 
investigations  and  the  limited  testing  suggests  that  this  nonpara-  > 

metric  decision  approach  to  voiced-unvoiced-silence  classification  of 
speech  should  be  considered  for  other  speech  processing  applications 

I where  the  robust  issue  must  be  addressed. 

I i 

: I 

I 


t 

I 


i 


I 


System  Description 

In  t roduc  t ion 

This  report  describes  the  signal  classification  decision  procedure, 
along  with  the  theoretical  justification  of  the  nonparametr ic  distri- 
bution model. 

The  system  operates  in  the  following  manner: 

The  speech  signal  is  low-pass  filtered  to  3.2  kHz  (telephone  band- 
width), sampled  at  6.67  kHz  rate  and  high-pass  filtered  at  approximately 
200  Hz  to  remove  any  dc  or  low  frequency  hum.  The  output  from  the 
high-pass  filter  is  formatted  into  blocks  of  100  samples  (15  msec  of 
speech  data).  Each  block  of  speech  Is  then  applied  to  four  digital  fil- 
ters. The  output  of  each  filter  and  the  unflltered  speech  time  series 
are  rank  ordered.  The  rank  order  values  are  then  passed  to  the  detector 
or  classifier  algorithm.  Figure  1 shows  a block  diagram  of  the  detection 
algorithm . 

Filter  Selection  Criteria 

The  digital  filter  bank  satisfies  two  functional  requirements  for 
the  system.  First,  it  provides  the  reference  noise  samples  for  the 
detector.  The  rationale  for  this  function  will  be  explained  in  the 
detector  description.  Secondly,  it  filters  the  speech  waveform  to  esti- 
mage  the  instantaneous  or  effective  bandwidth  of  the  speech  signal. 
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The  filter  design  method  is  based  on  the  work  of  Schafer  (34)  and 
Crochiere  ( 35 ) . 

The  important  property  achieved  by  this  filter  bank  is  that  the 
sum  of  the  individual  frequency  responses  of  the  bandpass  filters  (com- 
posite response)  lie  flat  with  linear  phase.  The  band-partionlng  is 
such  that  each  sub-band  contributed  equally  to  the  Articulation  Index. 

The  Articulation  Index  indicates,  on  the  average,  the  contribution  of 
each  part  of  the  spectrum  to  the  overall  perception  of  the  spoken  sound. 

By  partitioning  the  200  to  3200  Hz  frequency  range  into  four 
equal-contribution  bands,  each  sub-band  contributing  20  percent  to  the  AI . 

This  partitioning  corresponds  to  a word  intell igebil ity  of  approx- 
imately 93  percent  (33,  34). 

Figure  2 shows  the  partitioning  of  the  speech  spectrum  into  four 
contiguous  bands. 

These  filters  were  designed  using  McClellan,  Parks  and  Rabiner's 
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Figure  2 PARTITIONING  OF  THE  SPEECH  SPECTRUM  INTO 
FOUR  CONTIGUOUS  BANDS  THAT  CONTRIBUTE 
EQUALLY  TO  ARTICULATION  INDEX.  THE  FRE- 
QUENCY RANGE  IS  200  TO  3200  Hz. 


^ 

Nonparametr Ic  Detector  Design 


Introduction 

This  section  will  present  an  analysis  of  the  nonparametric  decision 
procedures  considered  for  experimental  verification. 

The  purpose  of  the  nonparametric  decision  procedure  design  is 
two-fold : 


1.  Develop  and  test  a nonparametric  detector  that  will  reliably 
estimate  the  bandwidth  occupied  by  the  speech  signal,  and 


2.  Develop  and  test  a nonparametric  detector  that  will  classify 
a given  set  of  speech  data  as  voiced  speech,  unvoiced  speech 
or  silence. 


A distribution  model  is  presented  in  order  that  the  form  of  the 
distributions  involved  can  be  used  to  obtain  a suitable  decision  pro- 
cedure and  test  statistic.  Using  this  model  as  a starting  point,  methods 
for  estimating  the  noise  are  examined. 

Data  Model 

To  capitalize  on  the  advantages  offered  by  nonparametric  hypothesis 
testing,  it  is  sufficient  to  Investigate  the  form  of  the  expected 
sample  distribution. 
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A distribution  model  for  the  speech  amplitude  values  is  required 


in  order  to  be  able  to  select  a suitable  test  statistic. 

Commeski,  Palz,  and  Glisson  (34,  36,  37,  38)  and  others  have  pro- 
posed a special  form  of  the  gamma  density  as  amplitude  probability 
distribution  of  speech 

where 

HMS  value  = 

A simpler  approximation  is  the  double  exponential  or  Laplacian 
density 

'Po.Cr)  - 


I where 


value  = <r^ 


V a 


I 


In  Figure  3,  the  gamma  and  Laplacian  densities  are  compared  with 
the  experimentally  determined  density  for  real  speech  (from  Palz  and 
Glisson)  (36) . 

The  amplitude  distribution  is  interpreted  as  the  sum  of  two  dis- 
tribution; one  distribution  with  a very  high  peak  at  zero  amplitude 
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corresponds  to  unvoiced  sounds  (e.g.,  fricatives)  and  system  noise, 
and  another, that  of  large  amplitude  values  corresponding  to  voiced 
sounds  (e.g.,  vowels/semivowels , etc.). 


I 

The  small  value  of  amplitude  (unvoiced  and  noise)  can  be  approx- 
imated as  a normal  distribution,  and  an  exponential  distribution  can 

I 

approximate  the  voiced  sounds. 

i 

An  alternative  model  can  be  derived  by  formulating  the  detector 
assuming  that  the  signal  is  applied  to  a square-law  detector  prior  to 
being  processed.  The  model  to  be  developed  is  used  to  determine  the 
signal  power  in  each  of  the  filters  outputs  (bandwidth  BrHz) . Thus, 
the  presence  of  a speech  signal  is  indicated  by  an  increase  in  the 
average  energy  of  the  processed  waveform.  The  signal  is  passed  through 
a narrow-band  bandpass  filter  centered  at  fkHz  and  having  a bandwidth 
of  BrHz.  This  filter  output  is  then  applied  to  a detector  which  con- 
sists of  an  absolute  value  process  to  approximate  a squarer  and  an 
averager.  The  detector  output  yields  an  estimate  of  the  signal  power 
in  this  bandwidth. 

When  random  Gaussian  noise  is  applied  to  the  input  to  the  bandpass 
filter,  it  can  be  shown  that  the  statistical  properties  of  the  output 
of  the  detector  have  a chi-square  distribution  of  2TW  degrees  of  free- 
dom (2,  Al). 

If  speech  is  modeled  as  a sinusoidal  signal,  the  density  of  detector 
output  is  a noncentral  gamma  density  than  can  be  approximated  by  a chi-  | 

square  distribution  and  if  modeled  as  flat  narrow-band  Gaussuan  (unvoiced). 
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Figure  3 REAL  SPEECH  AND  THEORETICAL  GAMMA  AND 

LAPLACE  PROBABILITY  DENSITIES. 


the  density  is  exponential  (56).  The  aforementioned  distributions  are 
part  of  the  gamma  distribution  family,  and  differ  only  In  the  shape 
parameter.  The  general  density  function  for  the  gamma  distribution  Is 
g iven  as 


■f(^)  = 


‘ v'“-'  p - '*/« 

«*•  P(r) 


y.  i o 


The  chi  square  distribution  with  ^ degrees  of  freedom  Is  exactly 
the  gamma  distribution  with  0*2  and  IT”  ^/2. 

Another  special  case  of  the  gamma  distribution  results  when  r = 1; 
this  Is  the  exponential  distribution. 

Noise  Estimation  Procedures 


The  basic  problem  of  detecting  the  bandwidth  occupied  by  the  speech 
signal  can  be  formulated  as  testing  the  simple  hypothesis  H versus  the 
composite  alternative  K where  H and  K are  given  by 

H : % = V 

K : Tt  = s,-  +V 

and  X is  the  input  sample,  V Is  zero  mean  Gaussian  noise  with  unknown 
variance  and  Sj,  is  the  speech  signal  to  be  detected. 
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Based  on  the  Input  samples  x denoted  by  x 


it  Is  desired  to  develop  a decision 
hypothesis  "best"  characterizes  the 


{ 


^= 


1, 


rule  for  determining  which  of  the 
sampled  data. 


Classical  parametric  detectors  assume  that  f (x/ H)  and  F(x/k)  are 
known  or  than  estimates  the  density  parameters  can  be  obtained  from 
a training  set  of  data. 


For  this  investigation  the  assumption  will  be  made  that  the  noise 

is  non-stationary . This  is  equivalent  to  assuming  the  noise  level  is  ! 

I 

I 

unknown  ahead  of  time  by  the  detector  and  the  classical  approach  does  I 

not  apply.  ' 

Provisions  for  obtaining  a set  of  reference  noise  samples  must 
be  incorporated  into  the  detector  design.  The  following  method  was 
investigated  for  obtaining  a reference  test  of  noise  samples. 


The  noise  samples  will  be  derived  from  the  output  of  the  filter 
band.  It  is  assumed  that  the  noise  process  is  wideband,  therefore, 
the  spectrum  is  approximately  flat  across  the  entire  band  200  - 3200  Hz. 

The  entire  frequency  spectrum  of  interest  is  partitioned  into  four  con- 
tiguous bands  (see  Figure  2).  Each  subband  is  assumed  to  be  independent. 

J 

It  is  assumed  that  sampled  data  x^  where  (J  = 1,  , 4;  ^ = 1, 

, 100)  are  independent  for  all  2 J ^11  elements  of  x**  = 

(xj  , ^2  ’ ’ identically  distributed  with  Cdf  F(s)  if  there 

is  only  noise  in  filter  j,  or  with  Foj  (x)  if  there  is  speech  in  filter  j.  j 


between  filters  is  a function  of  the  time  bandwidth  product  B = 2 AT^ 
and  for  band  separated  by  adjacent  slot  (j  + 2)  all  correlations  are 
sma 1 1 . 

Under  the  null  hypothesis  of  only  noise  in  the  filter  outputs,  all 

are  distributed  as  F(x).  The  presence  of  speech  in  any  of  the 
induced  a Laplacian  or  Gaussian  density  which  at  small  signal-to-noise 
ratio  can  be  characterized  as  a scale  alternative. 

This  result  leads  quite  naturally  to  the  use  of  two-sample  stat- 
istics. The  input  sample  data  in  the  filter  being  tested  form  the 
first  sample,  and  the  pooled  data  from  the  remaining  filters  form  the 
reference  sample  or  second  sample.  The  decision  procedure  proposed 
is  a multiple  decision  procedure  - - simultaneously  test  H (noise  only) 
and  upon  rejection  indicates  what  filter  output  contains  speech  data. 

The  decision  procedure  proposed,  forms  a test  statistic  for  each 
filter  output  and  compares  the  resulting  value  of  the  statistic  with 
a threshold  to  determine  what  filter  contains  speech. 

Decision  Procedure  (Simultaneous  Test) 

Under  null  hypothesis,  {4  , is  distributed  as  F(x),  2 “ 2, 

, n,  j = 1,  2,  3,  A.  Each  filter  output  is  tested  for  speech  energy. 

The  two  sample  processing  procedure  is: 
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> X* , - — ”*4) 

• • 

where  are  the  data  to  be  tested  and  y-J are  the  remaining  pooled  data. 

Let  S = where  D (•,*)  is  a two-sample  statistic.  The 

data  are  reduced  to  four  statistics  s^,  s^,  ®4^* 

obviously  dependent.  The  decision  procedure  is  to  declare  there  is 
speech  in  filter  j if 

Where  XK  is  the  threshold  determined  by  the  false-alarm  probability 
as  specified. 


The  Kruskol-Wallls  One-Way  Anova  Test 


The  exper lemental  situation  is  one  where  K random  samples  have 
been  obtained,  one  from  each  of  K possible  different  populations,  we  want 
to  test  the  null  hypothesis  that  all  of  the  populations  are  identical. 

Sample  1 Sample  2 Sample  K 

Xll 
Xi2 


the  Central  limit  theorem  may  be  used. 


• • 


^ ~ N (0,  1)  when  Ho  is  true 

Var  (Ri) 


and 


Ri  - E (Ri) 


L Var  (Ri)  J 


Chi-square  with  one  degree  of  freedom 


If  the  Ri  were  independent  of  each  other,  the  distribution  of 
the  sum 


T 


1 


Ri  - E (Ri) 
Var  (Ri) 


2 


Chi-square  with  k degrees 
of  freedom 


However  the  sum  of  the  Ri  is  K ni  so  there  is  a dependence  among  the 
Ri.  If  is  multiplied  by  for  i = 1,  2,  3 — K,  then  the  result 

is  asympotically  distributed  as  a Chi-square  with  K-1  degrees  of  freedom. 

Under  Ho  the  savage  statistic  satisfies 


The  decision  rules  is  declare 
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Ho  : if  i K 
HI  : If  > K 

Tlie  filter  with  the  largest  rank  is  declared  as  the  best  estimate 
of  tlie  bandwidth  of  the  speech  signal.  A detailed  description  of  this 
nonparametric  test  is  contained  in  References  (42,  44,  45,  49). 

Choice  of  a Two-Sample  Statistic 

The  form  of  the  distributions  developed  in  the  data  model  section 
of  this  proposal  identifies  the  following  two-sample  statistics  that 
will  be  incorporated  in  the  speech  detector. 

Savage  Statistic 

The  savage  statistic  was  studied  for  this  application  because  it 
is  the  optimum  rank  statistic  for  an  exponential  distribution  and  a 
scale  alternative  (43,  52). 

Assume  we  want  to  test  whether  two  samples  differ  in  scale  (disp- 
ersion). The  procedure  for  the  two  sample  problem  is  to  combine  both 
samples  into  a single  ordered  sample  and  then  assign  ranks  to  the  sample 
values  from  the  smallest  to  the  largest  value,  without  regard  to  which  pop- 
uplation  each  came  from.  The  test  statistic  is  the  sume  of  ranks  assigned 
to  the  values  from  one  of  the  population.  If  the  sum  (test  statistic) 
is  too  small,  or  too  large,  there  is  some  indication  that  the  values 
from  that  population  tend  to  be  smaller,  or  larger  than  the  values 
of  the  other  population.  The  null  hypothesis  of  no  difference  between 


populations  may  be  rejected  if  the  ranks  associated  with  one  sample 
tend  to  be  larger  than  those  of  the  other  sample. 


The  savage  statistic  has  the  form 

N 


S - AiZl 


where 


1 if  Xj  belongs  to  XI Xm 


Zj 


0 if  Xi  belongs  to  Xm+1, Xm+n 


simplified  Procedures 


The  rank-sum  Is  quite  complex;  the  procedure  requires  that  all 
data  in  the  filter  be  stored  and  also  requires  time  consuming  processing 
because  all  rank  values  have  to  be  re-adjusted  with  each  new  filter 
output . 

This  can  be  greatly  simplified  using  mixed  statistical  test. 

Feustal  (50)  shows  that  Mann-Whitney  statistic  presented  has  high 

2 

efficiency  but  requires  0 (N  ) operations  to  perform  the  ranking  oper- 
ation. Feustal  proposed  a mixed  statistical  test  that  required  0 (mn) 
operations.  The  mixed  statistic  operates  as  follows: 

The  N observation  from  each  K samples  are  divided  into  p groups 
of  m observations.  An  intermediate  statistic  on  each  of  the  ob  erva- 
tions  for  x samples  reduces  KN  observations  to  p values.  The  p values 
are  summed  to  form  a test  statistic  to  compare  with  the  threshold. 

The  paper  shows  that  for  m ^ 15  negligible  loss  in  efficiency  is 
experienced.  Woinsky  extended  these  results  to  the  two-sample  case 
and  through  simulation  confirms  Feustal  theoretical  results. 

This  variation  to  the  rank  sum  test  will  be  incorporated  into  the 
experimental  verification  test  to  simplify  the  computational  requirements 

Test  Pro cedur e 


The  nonparametr ic  detector  was  tested  on  nine  types  of  signals: 
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1)  no'. se  output  from  the  output  of  an  analog  noise  generator 


2)  iKU-kgrouiid  noise  from  the  Kliymo  flic 

3)  the  following  words  from  the  Rhyme  file:  Gob,  Sue,  Taunt,  Nil, 
Boast,  Jab,  and  Cheat 

The  dyagnostic  Rhyn\e  tape  was  supplied  by  Dyna  Stat  Inc.  (72). 

The  additive  white  noise  tape  was  generated  by  digitizing  the  analog 
output  of  an  analog  noise  generator.  Both  the  word  file  and  the  noise 
file  are  prefiltered  with  a low  pass  filter  having  a 3.2  kHz  cutoff 
frequency  and  is  sampled  at  6667.  Hz. 

The  program  ARPGEN'SAV  (70)  is  used  to  calculate  the  SNR  and  the 
Constant  C is  computed  by  SPLUSN'SAV  (70)  so  that  specified  signal-to- 
noise  ratio  test  words  can  be  created. 

Using  the  above  software  programs  and  data  files,  various  words 
with  additive  white  noise  of  progressively  smaller  signal-to-noise 
ratios:  40,  30,  20,  10,  and  0 dB  were  created  and  proceeded  by  the 

detector  algorithms. 


> 

I 

1 

[ 


Preliminary  Results 


Evaluation  Tests 


Six  preliminary  speech  tests  were  conducted  to  evaluate  the  speech 
detectors  performance  for  five  different  S/N  ratios;  0,  10,  20,  30, 
and  40  dB  of  wideband  Gaussian  noise.  For  each  clean  test  word  from 
the  Rhyme  file,  a manual  analysis  was  performed  on  each  15  msec  inter- 
val to  classify  it  as  voiced,  unvoiced,  or  silence  based  on  visual  in- 
spection of  the  acoustic  waveform  and  a phonetic  interpretation  of  the 
utterance.  Two  independent  manual  classifications  were  made  on  each 
test  word. 

Gaussian  noise  was  digitally  (see  test  procedure)  added  to  the 
clean  test  to  produce  a controlled  data  base  having  specified  slgnal- 
to-noise  ratios. 

Error  rates  were  computed  by  comparing  the  manual  classification 
with  the  detectors  classification  output. 

Experiment  No.  1 

The  first  test  measured  the  correctness  of  assuming  the  zero  mean 
Gaussian  and  Laplaclan  amplitude  distribution  models  for  noise  and 
speech,  the  assumption  that  speech  manifests  itself  as  a scale  alter- 
native, and  that  the  E(Ri)  * 20.0. 

The  Mann-Whitney  test  (48)  was  performed  on  the  noise  file  and 
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all  three  original  word  files  ( AO  dB) . The  test  results  were  not 
significant;  the  null  hypothesis  that  mean  value  is  essentially  zero 
could  not  be  rejected  at  the  95%  confidence  level.  When  the  savage  T 
test  was  used,  the  assumption  of  a scale  alternative  was  significant 
and  the  results  will  be  reported  in  the  remainder  of  the  presented 
results . 

The  mean  value  of  the  rank  order  statistic  for  the  savage  T measured 
M(Ri)  = 19.97,  Sx  = 5.97,  Sx  = .56  for  300  blocks  of  15  msec  data. 

This  compares  to  a normal  approximation  expected  value  of 

M(Ri)  - 20.00 
VAR  - 3.77 

The  above  results  did  not  experimentally  refute  the  amplitude 
distribution  model  assumption  of  that  the  normal  approximation  for 
the  test  statistic  was  not  valid. 

Experiment  No.  2 

Two  preliminary  speech  tests  were  conducted  to  evaluate  the  NP 
detector.  The  first  test  measured  the  accuracy  of  the  classification 
algorithm  when  the  savage  T test  was  used  to  compare  the  amplitude 
distribution  of  speech  and  noise.  The  test  method  rank  ordered  100 
samples  from  each  of  the  four  filter  outputs,  and  formed  the  pooled 
sample  to  estimate  the  noise.  Two  hypothesis  testing  procedures  were 
employed.  One  approach  is  to  test  Ho  by  the  Kruskal— Wallis  test 
statistic  and  upon  reject  use  a ranking  and  selection  procedure  to 
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locate  the  frequency  location  of  the  speech  sample.  The  other  approach 
employs  a procedure  that  simultaneously  tests  H and  upon  rejection 
indicates  what  frequency  component  is  present.  This  procedure  is  known 
in  statistical  literature  under  the  heading  of  multiple-decision  pro- 
cedures, multiple  comparison  procedures,  or  simultaneous  statistical 
inference . 


The  second  test  used  a mixed  statistical  test  where  the  absolute 
value  of  5 samples  were  averaged,  and  20  blocks  of  these  averaged  values 
were  then  pooled  and  tested  as  above.  This  method  cut  the  ranking 
requirement  from  100  to  20.  The  efficiency  of  this  method  was  then 
compared  to  the  classification  accuracy  of  the  full  ranking  algorithm. 

The  full  ranking  algorithm  tests  the  amplitude  distribution,  the 
mixed  statistical  test  compares  the  energy  distributions. 

For  the  100  sample  test,  the  threshold  value  of  the  classifier 
was  set  to  the  Chi-square  approximation  of  9.A8. 


The  simultaneous  test  results  threshold  f rom a signif icance  level 


of  .05  corrected  for  paired  comparison  by  a'  = 


K (K-1) 


where  a' 


.0083 


and  Z = 2.39, 


For  the  20  sample  test.  Chi-square  threshold  was  set  to  18.1. 
The  simultaneous  test  threshold  was  set  to  3.30. 
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Summary  of  Test  Results 


Tests  of  noise  classification  were  performed  on  the  analog  noise 
file  and  the  background  noise  in  the  Rhyme  file. 

The  % correct  recognition  for  the  K-W  test  (20  samples)  was  96%; 
the  K-W  test  (100  samples)  was  96.7%  on  the  analog  noise  file. 

The  % correct  recognition  of  the  simultaneous  test  (20  samples) 
and  (100  samples)  was  86%  and  92%. 

The  % correct  recognition  of  the  background  noise  of  the  Rhyme 
file  was: 

92%  - K-W  test  (20  samples) 

68%  - Simultaneous  test  (20  samples) 

93%  - K-W  test  (100  samples) 

85%  - Simultaneous  test  (100  samples) 

The  % recognition  test  the  seven  words  from  the  Rhyme  file  was 
performed  using  the  simplified  procedure  (sum  5 samples,  rank  the  20 
sums) . 

Table  1 summarizes  the  overall  recognition  rate  as  a function  of 
S/N  ratio  of  the  simultaneous  decision  procedure  for  all  the  test 
ut  terances . 
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SNR 

39 

30 

20 

10 

0 

Silence 

58 

88 

96 

94 

93 

94 

95 

88 

D 

53 

Unvoiced 

95 

96 

84 

59 

35 

Total  % 

82 

93 

89 

B 

60 

TABLE  1 

Recognition  Rate  for  the  Simultaneous 
Decision  Procedure  for  all  Seven  Words 

(20  samples) 

Silence  85.8%  recognition  overall 
Voiced  80.8%  recognition  overall 
Unvoiced  73.8%  recognition  overall 


Table  2 summarizes  the  recognition  rate  for  each  word  as  a 


function  of  S/N  ratios  for  the  K-W  decision  procedure. 


Recognition  Silence  Voiced  Unvoiced 


Silence  Overall  Recognition  Rate  = 96% 
Voiced  Overall  Recognition  Rate  = 77% 
Unvoiced  Overall  Recognition  Rate  = 56% 


A nonparametr ic  statistical  detector  for  recognizing  speech  has 
been  described  and  implemented.  Preliminary  results  of  limited  testing 
show  that  the  detector  performs  as  well  as  the  pattern  recognition 
approach  reported  in  the  literature  (18,  19,  20,  21,  22,  23).  In 
limited  testing,  the  classifier  performed  with  a misclassif ication 
rate  of  less  than  5%  with  thresholds  calculated  from  theory.  The 
desirable  feature  of  this  detection  or  classification  scheme  are 


that  it  does  not  require  a training  set  of  data  or  apriori  information 
of  the  statistical  parameter  of  speech  or  noise. 

The  manual  analysis  of  the  speech  waveform  contained  in  segments 
in  which  uncertain  intervals  occurred.  These  uncertain  intervals  were 
mostly  at  stop  gap  in  words  such  as  "boast"  and  "taunt".  The  inter- 
val corresponding  to  the  stop  gap  before  the  unvoiced  T,  amplitude  and 
frequency  characteristics  resembled  noise  but  did  not  contain  the 
wideband  noise  characteristic. 

If  the  segments  were  suppressed  and  the  word  was  acoustically 
transcr ipted , no  loss  of  intelligibility  occurred.  For  this  reason, 
the  segments  were  classified  as  noise. 

This  classification  technique  resulted  in  an  increase  in  recog- 
nition rate  between  the  39  dB  and  30  dB  signal-to-noise  test.  In 
effect,  the  addition  of  noise  pre-whited  the  low  frequency  component 
of  the  recording  noise  and  aided  the  classification  algorithm.  This 
urn- f r t a 1 n t y also  occurred  in  the  word  "cheat".  In  this  word  during 
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the  voicing  interval,  eight  transition  blocks  occurred.  These  trans- 
itions were  classified  as  errors  in  the  algorithm  detection  of  voicing. 


The  classification  of  the  unvoiced  "T"  inthe  word  "boast"  and 
"taunt"  was  correct  until  the  added  noise  obscured  the  "T"  sound. 

This  was  10  dB  for  the  word  "taunt"  and  20  dB  for  the  word  "boast". 

The  initial  "S"  in  "sue"  is  covered  by  noise  at  20  dB.  The  classifi- 
cation algorithm  decision  recognition  rate  is  calculated  by  using  the 
original  39  dB  clean  speech  utterance  as  the  reference.  If  a manual 
reclassification  was  accomplished  at  each  level  of  signal-to-noise 
ratio,  the  resulting  algorithm  classification  compared  with  one  or  two 
percent  of  the  same  recognition  rate  as  the  original  clean  speech 
classification.  The  20  sample  test  compared  favorably  with  the  100 
sample  test  when  the  threshold  was  adjusted  to  compensate  for  the  change 
in  number  of  degrees  of  freedom;  and  the  loss  of  efficiency  because 
of  using  the  mixed  statistical  decision  procedure.  The  20  sample  test 
will  be  used  for  the  remaining  analysis  in  this  research.  The  simult- 
aneous test  procedure  will  be  used  instead  of  the  K-W  decision  pro- 
cedure because  it  is  more  sensitive  to  the  detection  of  the  unvoiced 
Interval  of  speech.  The  loss  of  efficiency  for  recognizing  the  silence 
interval  will  be  investigated  and  an  alternate  decision  algorithm 
implemented  to  correct  this  deficiency. 
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Future  Research 


The  results  of  the  tests  indicate  that  sufficient  increases  in 
false  alarm  rates  are  experienced  with  the  mixed  statistic  simplified 
algorithm.  The  reasons  for  the  negative  results  are  at  present  not 
completely  understood.  Loss  of  efficiency  is  predicted  from  theory, 
but  the  correction  to  the  threshold  setting  to  offset  this  disadvantage 
isnot  readily  apparent.  A more  detailed  comparison  of  the  full  rank 
algorithm  (100  samples)  and  the  mixed  statistic  algorithm  will  be 
studied.  Alternate  decision  algorithms  proposed  in  the  theoretical 
description  will  also  be  tested. 

The  conditional  rank  test  approach  proposed  by  Kassam  (68),  was 
tried  on  a limited  set  of  data  and  looks  as  though  this  technique  can 
improve  the  silence  false  alarm  rate.  More  extensive  testing  of  this 
approach  will  be  undertaken.  A modified  level  test  where  the  two 
sample  test  updates  a fixed  threshold  that  regulates  the  false  alarm 
probability  and  a second  threshold  that  is  adaptive  and  estimates  the 
standard  deviation  of  the  pooled  samples  noise  estimate  to  compensate 
for  loss  of  efficiency  of  unvoiced  decision  at  low  S/N  will  be  tested. 
Preliminary  test  'of  this  concept  resulted  in  100%  recognition  of  the 
unvoiced  segment  of  the  word  "Taunt"  down  to  S/N  of  0 dB. 

A study  of  the  effects  of  Incorporating  these  techniques  into 
the  proposed  detection  algorithm  will  be  reported  in  the  next  semi- 


annual report. 
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Introduction 


In  signal  processing  the  use  of  the  Fourier 
transform  has  enjoyed  sinaular  success  not  only  as  a 
practical  tool  of  unmatched  power  and  rich  application 
but  also  as  a theoretical  viewpoint  imparting 
simplifying  insight.  Unfortunately  a problem  arises  in 
that  the  Fourier  integral  transform; 

f(u))  = f(t)e“’-“^dt 

is  impossible  to  calculate  on  a computer,  thus  we  need 
to  construct  approximations  in  order  to  make 
calculation  of  the  Fourier  transform  effective.  First 
we  usually  approximate  the  integral  by  a sum  i.e. 
time-sample  the  input  signal: 

r - . -iwnT 
f(w)  * 2.  f(nT)e 

n*-®“ 

Fven  with  this  simplification  an  infinite  sum  remains, 
therefore  in  order  to  make  the  calculation  in  a 
reasonable  amount  of  time  we  must  truncate  to  a finite 
number  of  terms; 


£(u) 


- X f{nT) 
n— N/^ 


-iwnT 

e 


94 


This  second  approximation  is  ecuivalent  to  truncatinq 
the  input  signal  to  a finite  record  length.  We  are  now 
confronted  with  a problem;  How  do  we  choose  our 
approximations  wisely  in  order  to  maintain  a reasonable 
accuracy  in  our  calculations?  For  time-sampling,  the 
familiar  Shannon  sampling  theorem  provides  a guide  so 
that  we  can  make  wise  decisions  concerning  this 
aoproximation.  The  second  approximation,  truncating 
the  innut  signal  or  "windowing,"  has  been  handled  by 
engineers  on  an  ad  hoc  basis.  There  has  been  no  analog 
of  the  Shannon  sampling  theorem  to  help  us  decide  how 
long  to  make  our  window  for  any  given  application. 
Indeed,  in  most  applications  it  is  anparent  that  a 
single  optimum  window  length  doesn't  exist;  for 
example  in  speech  processing  it  seems  that  medium 
length  windows  are  both  too  short  in  some  respects  and 
too  long  in  others. 

The  balance  of  this  chapter  will  develop 
technioues  to  circumvent  the  need  for  choosing  a window 
size.  In  the  next  section,  we  make  a quick  review  of 
the  current  fixed  window  length  technique;  which  (for 
reasons  that  will  become  evident)  we  will  call  constant 
bandwidth  technology.  Second,  a broader  view  of  the 
signal  processing  discipline  is  taken  so  that  we  may 
understand  some  fundamental  effects  of  the  constant 
bandwidth  technology  approximation  orocedure.  This 
fundamental  discussion  will  show  why  the  constant 
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bandwidth  technology  is  for  speech  processing  a poor 
anproximation  to  the  ideal  Fourier  transform.  The 
third  section  uses  the  fundamental  discussion  to 
generate  a good  approximation  to  the  Fourier  transform 
which  we  call  constant-0  technology.  Fourth,  follows  a 
list  of  applications  of  the  constant-0  technology. 


2.  Constant  Bandwidth  Technoloqv 


Since  sampled  speech  is  a ouasi-inf inite  seauence 
of  points  most  speech  processing  technioues  such  as 
LPC,  Homomorphic,  SABRE,  etc.  segment  this  seouence 
into  a series  of  (possibly  overlapping)  finite  length 
records  called  windows.  Implicit  within  all  these 
segmenting  techniques  is  a popular  operation  known  as 
the  short-time  spectrum.  The  (proto-sampled) 
short-time  spectrum  is  expressed  as  follows: 

f((i),t)  = £lf(T)  h{t-T)e"^“^  dT 

where  f(T)  is  the  input  speech  signal,  h(T)  is  a 
compact  support  (finite  length)  "windowing"  function 

A 

f(<»),t)  is  the  output  function  with  w indexing  the 
freauency  and  t indexing  the  center  of  the  window. 
This  process  has  been  known  for  a long  time  [11 . To 
achieve  computational  effectiveness  we  need  only  to 
sample  the  input  signal  and  convert  the  integral  to  a 
sum.  Since  the  window  function  has  finite  length  this 
sum  is  automatically  finite  hence  calculatable . 

Interesting  issues  arise  when  one  asks  for  an 
"inverse"'  to  this  transform  [2] . We  can  write  the 
inverse  as  follows: 

/"  r g(t-T)f  (a),T)e^“^  dudT 
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whore  <q,h>  is  the  Hilbert  soace  L inner  product,  g is 
a suitable  reconstruction  function.  (For  a guide  to 
choosing  g see  (21 . 


f 

% 
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The  followina  manipulation  of  the  short  time 
spectrum  is  important  not  only  for  theoretical  insight 
but  also  for  practical  applications.  We  rewrite 

A 

£(u,t)  as  follows: 

f{u,t)  - C,  f(T)h(t-T)e"^“^  dT 

- f(T)h(t-T)e^“‘'^"'’  dx 

The  last  expression  shows  that  for  fixed  w,  f(u,t) 
is  the  baseband  demodulated  output  of  a bandpass  filter 
with  center  freouency  u . Each  filter  has  a bandwidth 
determined  only  by  h and  is  independent  of  its  center 
freouency.  Hence  the  aopellation  constant  bandwidth. 

Note  that  in  order  to  present  f(u,t)  we  need  a 
two-dimensional  disolav  (an  imaoe) . This  process  then 
allows  two-dimensional  processing  technioues  to  be  used 
in  speech  processing. 


For  the  above  and  many  more  reasons  f(u),t)  is 
coming  out  of  its  implicit  role  in  the  old  algorithms 
to  enjoy  an  explicit  status  in  new  aloorithms. 


t , 

h 
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Perceotion  and  Signal  Processing 


In  order  to  understand  the  fundamental  aspects  of 
windowing  it  is  necessary  to  gain  a broader  perspective 
of  the  relationship  between  perception  and  signal 
processing.  The  Question  we  address  is:  Why  does  the 
Fourier  Transform  (instead  of  the  Mellin,  Hankel,  etc.) 
enjoy  such  singular  success  in  the  signal  processing 
discipline?  The  answer  is:  Because  the  Fourier 
transform  is  "matched"  with  the  manner  in  which  the 
signals  to  be  processed  are  produced  and  consumed. 
Specifically,  a speech  signal  is  produced  by  convolving 
a glottal  wave  with  the  vocal  tract  impulse  response. 
In  turn,  a speech  signal  is  consumed  by  the  human 
auditory  perception  system.  This  criterion  of  matching 
a transform  with  the  signal  production  and  consumption 
mechanism  is  a natural  one  for  engineers.  The 
two-dimensional  time-f reauency  spectrograms  which 
display  the  f(b),t)  of  the  short-time  spectrum 
correspond  roughly  to  processes  known  to  occur  in  the 
Human  Auditory  System;  this  explains  in  part  the 

A 

appeal  of  f(u,t)  and  its  growing  popularity.  We  shall 
ignore  the  matchino  between  the  transform  and  the 
signal  production  mechanism  and  concentrate  on  the 
relationship  between  the  transform  and  the  signal 
consumption  mechanism.  To  explain  what  we  mean 


precisely  by  "matching"  it  is  necessary  to  talk  about 
the  theory  of  Lie  group  representations.  This  would 
take  us  too  far  afield  for  the  present  exposition  so  we 
will  attemot  to  paraphrase  results.  A oarticularly 
powerful  observation  one  can  make  about  any  given 
system  is  the  symmetry  operations  one  can  perform  on 
that  system.  For  example,  a symmetry  operation 
consonant  with  the  auditory  system  is  time  translation, 
since  perception  of  a speech  waveform  is  unaffected  by 
a pure  time  delay.  However,  the  auditory  system  is  not 
subject  to  time  reversal  symmetry;  reversal  of  a 
speech  waveform  completely  destroys  intelligibility. 
The  set  of  symmetry  operations  can  be  concatenated  and 
inverted,  i.e.  they  form  a group.  Using  Lie  Group 
Theory  one  can  deduce  all  transforms  related  to  that 
group.  These  "symmetric"  transforms  are  called 
eauivariant  or  intertwining  operators.  For  example, 
the  Fourier  transform  can  be  deduced  solely  on  the 
basis  of  the  group  of  time  translations  above.  This 
then  is  what  we  mean  bv  a matched  transform  and  system: 
The  symmetry  operations  must  be  the  same;  as  in  the 
case  of  the  time  translation  group  corresponding  to 
both  the  auditory  system  and  the  Fourier  transform. 

One  can  raise  the  ouestion:  Are  there  further 
symmetries  corresponding  to  the  auditory  system?  There 
are  several  clues  from  osychophysical  studies.  As  we 
mentioned  above,  the  auditory  system  does  process 


p — ^ 

i . i ; 

speech  into  a time-frequency  format.  However, 
experiments  don't  seem  to  uncover  any  basic  window. 

For  example,  under  a constant  bandwidth  (fixed  window)  j j 

^ recrime  the  time  resolution  at  all  freouencies  remains 

the  same.  But  the  auditory  system  analyses  high 
freouencies  with  much  finer  time  resolution  than  low 
frequencies.  Also  modeling  the  Auditory  system  as  a 
bank  of  tuned  bandpass  filters  one  must  use  (instead  of 
constant  bandwidth  filters)  constant-0  filters  i.e. 
filters  with  bandwidths  a fixed  percentage  of  their 
center  freouencies.  These  properties  suggest  that  a 
symmetry  group  of  the  auditory  system  is  the  group  of 
time  scaling  and  shifting — the  at+b  group. 

Simole  calculus  reveals  that  the  Fourier  transform 
reacts  gracefully  to  the  at+b  group;  the  Fourier 
transform  not  only  has  a simple  time  translation 
nrooerty  but  also  a simple  time  scaling  prooerty.  The 
fundamental  observation  about  the  windowing 
approximation  is  that  it  destroys  the  scaling  property 
enjoyed  by  the  Fourier  transform  In  fact  the  symmetry 
group  corresponding  to  the  constant  bandwidth 
technology  is  time  translation  and  sinusoidal 
modulation,  i.e.  freauency  translation.  That 
frequency  translation  is  not  a symmetry  of  the  Auditory 
system  is  readily  apparent  to  anyone  who  has  tried  to 
understand  a mistuned  single  sideband  speech  signal. 
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Constant-0  Technology 
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This  section  discusses  a t itne-f reauency  transform 
that  corresponds  to  the  at+b  group.  From  Lie  group 
theory  one  obtains  the  following  intertwinning 
operator : 

■ C " (^)(^  '<«’  « 

To  obtain  a time-f reauency  analog  we  Fourier  transform 
along  the  first  variable  and  combine  expressions  to 
obtain: 

.)  £(«.<:)  - r ^ 

where  h is  the  Fourier  transform  of  k and  serves  as  the 
window  function.  The  matter  of  an  inverse  is  more 
delicate  than  for  the  constant  bandwidth  case.  An 
inverse  can  be  shown  to  be: 


i 

> 


**)  f{t) 

where  k 


lin  — / r f {w»T)g(u)(t-T))e^“^lu)l  dxdu) 

^8  -2k  loge 

g(u)du  • h(u)e^  du* 


A number  of  notable  features  of  *)  and  **)  should  be 
mentioned.  First,  the  window  length  of  h changes  for 


each  froouencv  w so  that  no  fixed  window  length  need 
be  chosen.  Second,  the  time  resolution  for  the  higher 
freouencies  is  sharper  than  for  the  lower  freauencies; 
thus  the  constant-0  technology  mimics  the  time 
resolution  properties  of  the  auditory  system.  Third,  a 
manipulation  similar  to  the  constant  bandwidth  case 
reveals  a filter  bank  analoqy.  Sut  in  this  case 
instead  of  identical  bandwidth  filters  the  bandpass 
filters  increase  in  bandwidth  as  their  center 
frequencies  increase,  thus  maintaininn  a fixed  ratio  of 
bandwidth  to  center  freouency  (O) . Finally  the 
constant-0  technology  preserves  symmetry  under  the  at+b 
group  (since  it  was  designed  to  do  so)  thus  satisfying 
our  matching  criterion  between  transform  and  perceotual 
system. 


5 


Applications 


Almost  all  speech  processinq  algorithms  implicitly 
contain  some  version  of  the  constant  bandwidth 

technology.  To  apply  the  constant-0  technology  one  , , 

must  make  explicit  f(u),t)  and  replace  the  constant 
bandwidth  with  constant-0.  LPC  is  an  excellent 

example:  When  viewed  as  a spectral  approximation 

technioue  it  is  easily  seen  that  f(u)»t)  is  the  object 
that  is  actually  being  approximated  by  LPC.  Recently 
Makhoul  [7]  has  found  that  an  ad  hoc  smoothing  of  the 
higher  freouencies  results  in  an  elimination  of  the  j 

"buzziness"  in  LPC  speech.  The  constant-0  technology 
supplies  a built-in  smoothing  so  that  this  very 

important  ouality  issue  in  LPC  is  automatically  solved 
by  constant-0.  This  should  not  be  surprising  given  the 
fundamental  matching  criterion  between  transform  and 
auditory  system.  Further  applications  arise  from  an 
AGC/Blind  deconvolution  algorithm.  Also  the  noise  | 

reduction  methods  mentioned  elsewhere  in  this  report  j 

i 

can  also  be  converted  from  constant  bandwidth  to 

constant-0  technology.  Since  the  present  algorithms 
utilize  windowing  it  is  not  surprising,  given  the 
fundamental  criterion,  that  disturbing  ouality 
degradations  are  inflicted  upon  material  to  be 
restored. 
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Finally  it  should  be  noted  at  it  is  not  known 
whether  clever  digital  implementations  of  the 
constant-0  algorithm  exist.  Current  digital  versions 
run  ouite  slowly  since  they  are  implemented  via  fast 
convolution  as  a filter  bank.  However,  the  fact  that 
the  bandwidth  of  the  filters  increase  with  center 
freguency  allows  approximately  a tenfold  decrease  in 
the  number  of  filters  reouired  by  the  old  constant 
bandwidth  technology.  In  addition,  although  clever 
digital  implementations  are  unknown;  straightforward 
CCD  implementations  class  among  the  easiest  and  most 
natural  applications  of  CCD's.  A single  CCD  bandpass 
prototype  can  be  used  for  all  channels  simply  by  fixing 
the  clock  rate  to  be  some  multiple  of  the  center 
f reouencv. 
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